Skip to content

fix: address unresolved review comments from PyPDF File Processor PR#4743#5173

Merged
cdoern merged 47 commits intollamastack:mainfrom
RobuRishabh:RHAIENG-1823-Address-Unresolved-Reviews
Apr 9, 2026
Merged

fix: address unresolved review comments from PyPDF File Processor PR#4743#5173
cdoern merged 47 commits intollamastack:mainfrom
RobuRishabh:RHAIENG-1823-Address-Unresolved-Reviews

Conversation

@RobuRishabh
Copy link
Copy Markdown
Contributor

What does this PR do?

Addresses remaining unresolved review comments from PR #4743 (PyPDF File Processor integration) to ensure the file processing pipeline is consistent, correctly typed, and aligned with API expectations.

Key changes:

  • Remove legacy chunking fallback: Eliminate the _legacy_chunk_file method and all fallback paths from OpenAIVectorStoreMixin. The system now raises a clear RuntimeError if file_processor_api is not configured, instead of silently degrading to legacy inline parsing.
  • Wire file_processor_api through all vector_io providers: Add Api.file_processors to optional_api_dependencies in the registry, pass it through all 12 factory functions, and accept/forward it in all 9 adapter constructors.
  • Make files_api required in PyPDF constructors: Remove the default None from both PyPDFFileProcessorAdapter and PyPDFFileProcessor, and use deps[Api.files] (bracket access) in the factory to fail fast if somehow missing.
  • Bounded file reads: Implemented chunked reading for direct uploads to cap memory usage, and add a size check on the file_id retrieval path against max_file_size_bytes.
  • Clear error for missing file_id: Wrap openai_retrieve_file in a try/except that surfaces a ValueError("File with id '...' not found"), with a new test covering this case.
  • Conditional text cleaning: Make the .strip() whitespace-only page filter conditional on the clean_text config setting.
  • Dead code cleanup: Remove unused file_processor_api field from VectorStoreWithIndex and the now-unused make_overlapped_chunks import from the mixin.

Closes #4743

Test Plan

Automated tests

1. Unit tests (mixin + vector_io)

KMP_DUPLICATE_LIB_OK=TRUE uv run --group unit pytest -sv tests/unit/

All test_contextual_retrieval.py (16 tests) and test_vector_store_config_registration.py tests — these exercise the refactored OpenAIVectorStoreMixin.

2. PyPDF file processor tests (20/20 pass)

uv run --group test pytest -sv tests/integration/file_processors/test_pypdf_processor.py

3. Full integration suite (replay mode)

uv run --group test pytest -sv tests/integration/ --stack-config=starter

Result: 4 failed, 54 passed, 639 skipped, 1 xfailed

All 4 failures are pre-existing and unrelated:

  • test_safety_with_image — Pydantic schema mismatch (type: 'image' vs 'image_url')
  • test_starter_distribution_config_loads_and_resolves / test_postgres_demo_distribution_config_loads — relative path FileNotFoundError
  • test_mcp_tools_list_with_schemas — no local MCP server (Connection refused)

No regressions in vector_io, file_search, or ingestion workflows.

Manual E2E verification (with starter distro)

export OLLAMA_URL="http://localhost:11434/v1"
llama stack run starter

1. Verify route is registered:

curl -sS http://localhost:8321/v1/inspect/routes \
  | jq -r '.data[] | select(.route|test("file-processors";"i"))'

Expected:

{
  "route": "/v1alpha/file-processors/process",
  "method": "POST",
  "provider_types": [
    "inline::pypdf"
  ]
}

2. Verify OpenAPI contains the endpoint:

curl -sS http://localhost:8321/openapi.json | rg -n "/v1alpha/file-processors/process|file-processors" -i

3. Direct file upload:

curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file=@/path/to/sample.pdf" | jq

Expected: chunks response with metadata.processor = "pypdf".

4. Via file_id:

FILE_ID="$(curl -sS -X POST http://localhost:8321/v1/files \
  -F "file=@/path/to/sample.pdf" -F "purpose=assistants" | jq -r .id)"

curl -sS -X POST http://localhost:8321/v1alpha/file-processors/process \
  -F "file_id=${FILE_ID}" | jq

Expected: chunks response with metadata.processor = "pypdf" and file_id in chunk metadata.

@meta-cla
Copy link
Copy Markdown

meta-cla bot commented Mar 17, 2026

Hi @RobuRishabh!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@meta-cla
Copy link
Copy Markdown

meta-cla bot commented Mar 17, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 17, 2026
…lamastack#4743

- Remove legacy chunking fallback and _legacy_chunk_file from vector store mixin;
  raise RuntimeError if FileProcessor API is not configured
- Wire file_processor_api through all vector_io providers (registry, factories,
  adapter constructors)
- Make files_api required in PyPDF adapter and processor constructors
- Implement chunked file reading (64KB) for direct uploads to cap memory usage
- Add size check on file_id retrieval path against max_file_size_bytes
- Wrap openai_retrieve_file in try/except to surface clear ValueError for
  missing file_id, with test coverage
- Make .strip() page filter conditional on clean_text config
- Remove unused file_processor_api field from VectorStoreWithIndex
- Clean up dead imports (make_overlapped_chunks) from mixin
- fixed linters, formats using pre-commit checks
- fixed pypdf to handle .txt files

Signed-off-by: roburishabh <roburishabh@outlook.com>
@RobuRishabh RobuRishabh force-pushed the RHAIENG-1823-Address-Unresolved-Reviews branch from 43b1105 to 1eb5352 Compare March 17, 2026 17:36
@RobuRishabh RobuRishabh requested a review from cdoern March 17, 2026 19:52
Copy link
Copy Markdown
Collaborator

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing additions to VectorIORouter so all the tests are failing bc the args to the router are mismatched.

Signed-off-by: roburishabh <roburishabh@outlook.com>
  - Add MIME type parsing safety check to prevent IndexError
  - Document chunked file reading approach and rationale
  - Make file_processors a hard dependency for all vector_io providers
  - Add unit test for missing file_processor_api error handling

Signed-off-by: roburishabh <roburishabh@outlook.com>
@RobuRishabh RobuRishabh requested a review from cdoern March 19, 2026 17:08
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Mar 25, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @RobuRishabh please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 25, 2026
Remove duplicate legacy chunking code that was incorrectly merged
alongside the new FileProcessor API path, and fix incomplete
RuntimeError syntax. Also remove unused make_overlapped_chunks import

Signed-off-by: roburishabh <roburishabh@outlook.com>
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Recordings committed successfully

Recordings from the integration tests have been committed to this PR.

View commit workflow

@cdoern
Copy link
Copy Markdown
Collaborator

cdoern commented Apr 7, 2026

The replay failures are not related to this PR but they block it from merging. All file_search tests that need to be re-recorded fail during replay because request hashes don't match the recordings.

From the server log:

RuntimeError: Recording not found for request hash: bfa38c097b0589ca...
Model: gpt-4o | Request: POST https://api.openai.com/v1/v1/chat/completions

The file_search tests generate new vector store IDs, file IDs, and chunk scores on each run. These end up in the chat completion request body, changing the hash so the replay system can't find the matching recording. The API recorder needs to normalize these non-deterministic fields before hashing for this PR to pass CI.

@cdoern
Copy link
Copy Markdown
Collaborator

cdoern commented Apr 7, 2026

@franciscojavierarceo @alinaryan @RobuRishabh ^^

@cdoern
Copy link
Copy Markdown
Collaborator

cdoern commented Apr 7, 2026

that is my best guess at least.

RobuRishabh and others added 15 commits April 7, 2026 15:21
… hashing

Signed-off-by: roburishabh <roburishabh@outlook.com>
Signed-off-by: roburishabh <roburishabh@outlook.com>
…l results

Signed-off-by: roburishabh <roburishabh@outlook.com>
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
…ool results

The patterns match UUID format (8-4-4-4-12 hex digits) as document IDs
are UUIDs, not file-IDs in the actual tool results.

Updated OpenAI test recordings with new normalization (16 recordings).
Deleted Azure test recordings (85 recordings) - these will be regenerated
by CI with the new normalization when Azure integration tests run.

Verified locally with OpenAI setup in replay mode - 16/16 tests pass.

Signed-off-by: roburishabh <roburishabh@outlook.com>
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Update test counts in provider matrix to reflect new Azure file_search recordings.

Signed-off-by: roburishabh <roburishabh@outlook.com>
Co-Authored-By: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@RobuRishabh RobuRishabh requested a review from cdoern April 9, 2026 13:56
Copy link
Copy Markdown
Collaborator

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really feel like something must be wrong here since we are adding 157,000 lines..... most of which are removing, moving, and re-adding recordings. I dont want to stall this further but I really encourage PRs with less commits and that are tested locally before opening the PR in the future if possible. Thank you!

@franciscojavierarceo
Copy link
Copy Markdown
Collaborator

I really feel like something must be wrong here since we are adding 157,000 lines..... most of which are removing, moving, and re-adding recordings. I dont want to stall this further but I really encourage PRs with less commits and that are tested locally before opening the PR in the future if possible. Thank you!

are you saying this wasn't tested locally?

@RobuRishabh I believe you did test this locally, right?

@cdoern
Copy link
Copy Markdown
Collaborator

cdoern commented Apr 9, 2026

@franciscojavierarceo I am saying that there were a dozen or so commits running pre-commit and re-pushing for integration tests failures both of which could be run locally with pre-commit or integration-tests.sh --inference-mode replay all of the commits makes it hard to figure out if we really need re-recording using the action or if we are unnecessarily re-recording

@cdoern
Copy link
Copy Markdown
Collaborator

cdoern commented Apr 9, 2026

because most of the recording changes here are not functional: eg. updates to the OpenAI SDK version

@RobuRishabh
Copy link
Copy Markdown
Contributor Author

I really feel like something must be wrong here since we are adding 157,000 lines..... most of which are removing, moving, and re-adding recordings. I dont want to stall this further but I really encourage PRs with less commits and that are tested locally before opening the PR in the future if possible. Thank you!

are you saying this wasn't tested locally?

@RobuRishabh I believe you did test this locally, right?

Yes, I did test this locally. I ran the responses integration tests with --inference-mode record initially, then to minimize the diff I ran it with record-if-missing against the ci-tests server config, and all relevant tests passed the only failures were pre-existing web search environment issues unrelated to my changes. However, since I didn't have Azure and OpenAI API keys, I wasn't initially able to figure out why the recording-dependent tests were failing. When you mentioned pulling the recordings from the branch and comparing them locally, I tried that and it worked and the new recordings were generated successfully. The rest of the tests (unit tests, non-recording-dependent integration tests) were all tested locally

@cdoern cdoern added this pull request to the merge queue Apr 9, 2026
Merged via the queue into llamastack:main with commit 5d63477 Apr 9, 2026
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants